15th June 2022

Single Cell RNAseq Analysis Workflow

10x overview

Not every droplet is useble

Quality Control overview

  • Aim of QC

    • To remove undetected genes
    • To remove empty droplets
    • To remove droplets with dead cells
    • To remove Doublet/multiplet
  • Above is achieved by …

    • Applying hard cut-off or soft cut-off on …
      • Number of genes detected per cell
      • Percent of mitochondrial genes per cell
      • Number of UMIs/transcripts detected per cell

Quality Control

Bioconductor R packages:

  • scran: Collection functions for interpretation of single-cell RNA-seq data
  • scater: For focus on quality control and visualization.
  • DropletUtils: Handling single-cell (RNA-seq) data from droplet technologies such as 10X Genomics

Orchestrating Single-Cell Analysis with Bioconductor Robert Amezquita, Aaron Lun, Stephanie Hicks, Raphael Gottardo

http://bioconductor.org/books/release/OSCA/

Read CellRanger outputs into R

  • CellRanger outputs: gives two output folders raw and filtered

  • Each folder has three zipped files

    • features.tsv.gz, barcodes.tsv.gz and matrix.mtx.gz
    • raw_feature_bc_matrix
      • All valid barcodes from GEMs captured in the data
      • Contains about half a million to a million barcodes
      • Most barcodes do not actually contain cells
    • filtered_feature_bc_matrix
      • Excludes barcodes that correspond to this background
      • Contains valid cells according to 10x cell calling algorithm
      • Contains 100s to 1000s of barcodes

Single Cell Experiment Vocabulary alert

  • cell = Barcode = droplet
  • Transcript = UMI

The SingleCellExperiment object

The Counts Matrix

To access counts from sce object: counts(sce)

Feature metadata

To access gene metadata from sce object: rowData(sce)

Droplet annotation (Cell metadata)

To access cell metadata from sce object: colData(sce)

Properties of RNAseq data - Number of genes detected per cell

Properties of RNAseq data - Total UMIs

Properties of RNAseq data - Distribution of counts for a gene across cells

Properties of RNAseq data - Distribution of counts for a gene across cells

Remove undetected genes

Although the count matrix has 36601 genes, many of these will not have been detected in any droplet. We can remove these to reduce the size of the count matrix.

undetected_genes <- rowSums(counts(sce)) == 0
sce <- sce[!undetected_genes,]
sce

Quality Control

  • Not all of the droplets called as cells by CellRanger will contain good quality cells
  • Poor quality droplets will adversely affect downstream analysis
  • We can use QC metrics to filter out poor quality droplets:
    • Total UMIs (Library size)
    • Number of gene detected
    • Proportion of UMIs mapping to mitochondrial genes

Quality Control

  • Add gene annotation to identify Mt genes - AnnotationHub
rowData(sce)

Quality Control

is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets = list(Mito = is.mito))

Adds six columns to the droplet annotation:

  • sum: total UMI count
  • detected: number of features (genes) detected
  • subsets_Mito_sum: number of UMIs mapped to mitochondrial transcripts
  • subsets_Mito_detected: number of mitochondrial genes detected
  • subsets_Mito_percent: percentage of UMIs mapped to mitochondrial transcripts
  • total: also the total UMI count

Quality Control

is.mito <- which(rowData(sce)$Chromosome=="MT")
sce <- addPerCellQC(sce, subsets = list(Mito = is.mito))
colData(sce)
## DataFrame with 3094 rows and 8 columns
##                         Sample            Barcode       sum  detected subsets_Mito_sum
##                                      
## AAACCTGAGACTTTCG-1  SRR9264343 AAACCTGAGACTTTCG-1      6677      2056              292
## AAACCTGGTCTTCAAG-1  SRR9264343 AAACCTGGTCTTCAAG-1     12064      3177              575
## AAACCTGGTGCAACTT-1  SRR9264343 AAACCTGGTGCAACTT-1       843       363              428
## AAACCTGGTGTTGAGG-1  SRR9264343 AAACCTGGTGTTGAGG-1      8175      2570              429
## AAACCTGTCCCAAGTA-1  SRR9264343 AAACCTGTCCCAAGTA-1      8638      2389              526
## ...                        ...                ...       ...       ...              ...
## TTTGGTTTCTTTAGGG-1  SRR9264343 TTTGGTTTCTTTAGGG-1      3489      1600              239
## TTTGTCAAGAAACGAG-1  SRR9264343 TTTGTCAAGAAACGAG-1      7809      2415              548
## TTTGTCAAGGACGAAA-1  SRR9264343 TTTGTCAAGGACGAAA-1      9486      2589              503
## TTTGTCACAGGCTCAC-1  SRR9264343 TTTGTCACAGGCTCAC-1      1182       591              224
## TTTGTCAGTTCGGCAC-1  SRR9264343 TTTGTCAGTTCGGCAC-1     10514      2831              484
##                    subsets_Mito_detected subsets_Mito_percent     total
##                                             
## AAACCTGAGACTTTCG-1                    12              4.37322      6677
## AAACCTGGTCTTCAAG-1                    12              4.76625     12064
## AAACCTGGTGCAACTT-1                    11             50.77106       843
## AAACCTGGTGTTGAGG-1                    12              5.24771      8175
## AAACCTGTCCCAAGTA-1                    13              6.08937      8638
## ...                                  ...                  ...       ...
## TTTGGTTTCTTTAGGG-1                    11              6.85010      3489
## TTTGTCAAGAAACGAG-1                    12              7.01754      7809
## TTTGTCAAGGACGAAA-1                    12              5.30255      9486
## TTTGTCACAGGCTCAC-1                    11             18.95093      1182
## TTTGTCAGTTCGGCAC-1                    12              4.60339     10514

QC metrics - distribution

plotColData(sce, x="Sample", y="sum") + scale_y_log10()
plotColData(sce, x="Sample", y="detected") + scale_y_log10()
plotColData(sce, x="Sample", y="subsets_Mito_percent")

QC metrics - relationship

Identification of low-quality cells with adaptive thresholds

sce$low_lib_size <- isOutlier(sce$sum, log=TRUE, type="lower")
sce$low_n_features <- isOutlier(sce$detected, log=TRUE, type="lower")
sce$high_Mito_percent <- isOutlier(sce$subsets_Mito_percent, type="higher") 

All three filter steps at once

cell_qc_results <- quickPerCellQC(colData(sce), percent_subsets=c("subsets_Mito_percent"))

All three filter steps at once

cell_qc_results <- quickPerCellQC(colData(sce), percent_subsets=c("subsets_Mito_percent"))

Filter the Single Cell Object

  • Filter cells according to QC metrics.
sce.Filtered <- sce[, !cell_qc_results$discard]
sce.Filtered